Gradient Descent

An blog post illustrating the key techniques of gradient descent
Author

Mia Tarantola

Published

March 9, 2023

Mia Tarantola - Gradient Descent

Introduction

We are looking at the use of gradient descent for optimization and the logistic loss problem. In this assignment I implement gradient descent for logistic regression, implement stochastic gradient descent and perform several experiments that test the limits of the algorithms

link to gradient descent code

Importing Modules

import importlib
import gradient_descent
from gradient_descent import LogisticRegression 

%load_ext autoreload
%autoreload 2

from sklearn.datasets import make_blobs, make_circles
from matplotlib import pyplot as plt
import numpy as np
np.seterr(all='ignore') 
{'divide': 'warn', 'over': 'warn', 'under': 'ignore', 'invalid': 'warn'}

Gradient Descent

np.random.seed(8674)

n = 100
p_features = 3

X1, y1 = make_blobs(n_samples = 100, n_features = p_features-1,centers=[(-1.7,-1.7),(1.7,1.7)])

fig1 = plt.scatter(X1[:,0], X1[:,1], c = y1)
xlab = plt.xlabel("Feature 1")
ylab = plt.ylabel("Feature 2")

LR1 = LogisticRegression()
LR1.fit(X1, y1, alpha = 0.1, max_epochs = 1000)

The fit method begins with generating a random wight vector with size X_features+1. We then add a column of zeros to X to create our X_ array. We set our initial previous loss to infinity so it is guaranteed to update after the first epoch. Then, while our loss has not converged or we have not reached the maximum number of epochs, do the following:

  • update w by \(w -= \alpha* \nabla f(X_{})\)
  • calculate the loss of the current state
  • update the loss history and score history
  • set the previous loss = new loss
  • the loss, predict, score and sigmoid functions were adpated from the lectured notes and perceptron blog post
def draw_line(w, x_min, x_max):
  x = np.linspace(x_min, x_max, 101)
  y = -(w[0]*x + w[2])/w[1]

  plt.plot(x, y, color = "black")

fig1 = plt.scatter(X1[:,0], X1[:,1], c = y1)
fig1 = draw_line(LR1.w, -2, 2)

xlab = plt.xlabel("Feature 1")
ylab = plt.ylabel("Feature 2")

print(LR1.loss_history[-3])
print(LR1.score_history[-1])
0.011075715117632856
1.0

We can see that this line is a good separator as our loss is small and our score is 1.

Stochastic Descent

np.random.seed(8674)

X2, y2 = make_blobs(n_samples = 100, n_features = p_features-1,centers=[(-1.7,-1.7),(1.7,1.7)])

fig2 = plt.scatter(X2[:,0], X2[:,1], c = y2)
xlab = plt.xlabel("Feature 1")
ylab = plt.ylabel("Feature 2")

LR2 = LogisticRegression()
LR2.fit_stochastic(X2, y2, alpha = 0.1, max_epochs = 1000, batch_size = 15)

The Stochastic fit function beginns similiarly to the original gradient descent function: > - create X_ > - generate random weight vector of size features+1 > - set previous loss = infinity

Then we have to iterate thrugh the following for the number of max epochs or until the loss converges:

  • shuffle the points randomly
  • pick the first random k points and update the weights vector using the stochastic gradient
  • pick the next set of random points and repeat
  • update the loss and score history
  • reshuffle the points randomly and iterate again
  • the gradient function is the same as the original one, we are just inputting a subset of X_ and y_ into the function repeatedly
fig2 = plt.scatter(X2[:,0], X2[:,1], c = y2)
fig2 = draw_line(LR2.w, -2, 2)

xlab = plt.xlabel("Feature 1")
ylab = plt.ylabel("Feature 2")

print(LR2.loss_history[-3])
print(LR2.loss_history[-2])
print(LR2.loss_history[-1])
print(LR2.score_history[-1])
0.0033704921863074765
0.003368276616639715
0.0033660810616696095
1.0

This is a good separator as our loss is close to 0 and our score is close to 1

Comparing methods

LR = LogisticRegression()
LR.fit_stochastic(X2, y2, 
                  max_epochs = 10000, 
             
                  batch_size = 10, 
                  alpha = .05) 

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "stochastic gradient (a=0.05)")

LR = LogisticRegression()
LR.fit_stochastic(X2, y2, 
                  max_epochs = 10000, 
                  batch_size = 10, 
                  alpha = .1)

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "stochastic gradient a = .1")

LR = LogisticRegression()
LR.fit(X2, y2, alpha = .05, max_epochs = 10000)

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "gradient")

plt.loglog()

legend = plt.legend() 

For these settings, we can see that the stochastic methods are performing much better than the regular gradient descent method. All three methods have a smooth decline in loss, but the stochastic gradient seems to have had the fastest decline. I believe the the gradient methods would need more epochs to find a better solution.

Experiment: alpha is too large

np.random.seed(4001)

n = 100
p_features = 3

X3, y3 = make_blobs(n_samples = 100, n_features = p_features-1,centers=[(-1.7,-1.7),(1.7,1.7)])

fig1 = plt.scatter(X3[:,0], X3[:,1], c = y3)
xlab = plt.xlabel("Feature 1")
ylab = plt.ylabel("Feature 2")

#alpha is too large
LR3 = LogisticRegression()
LR3.fit(X3, y3, alpha = 200, max_epochs = 10000)

#alpha is a normal value
LR5 = LogisticRegression()
LR5.fit(X3, y3, alpha = .01, max_epochs = 10000)
#alpha = 200
fig3 = plt.scatter(X3[:,0], X3[:,1], c = y3)
fig3 = draw_line(LR3.w, -2, 2)

xlab = plt.xlabel("Feature 1")
ylab = plt.ylabel("Feature 2")
LR3.loss_history
LR3.w
array([125.42362443, 123.04901387, -14.35961152])

#alpha = .01
fig3 = plt.scatter(X3[:,0], X3[:,1], c = y3)
fig3 = draw_line(LR5.w, -2, 2)

xlab = plt.xlabel("Feature 1")
ylab = plt.ylabel("Feature 2")


num_steps1 = len(LR3.loss_history)
plt.plot(np.arange(num_steps1) + 1, LR3.loss_history, label = "(a=200)")

num_steps2 = len(LR5.loss_history)
plt.plot(np.arange(num_steps2) + 1, LR5.loss_history, label = "a =.01")




legend = plt.legend() 

We can see that both alphas result in good separators. But, we have discovered one of the caveats of logistic regression. It can combat alphas that are too large by increasing all of the weights. So, for data that is linearly separable the model performs as good as you want it to. Hence, we will now investigate non-linearly separable data

Non linearly separable

np.random.seed(8680)

n = 100
p_features = 4

X4, y4 = make_blobs(n_samples = 100, n_features = p_features-1,centers=[(-1,-1),(1,1)])

fig1 = plt.scatter(X4[:,0], X4[:,1], c = y4)
xlab = plt.xlabel("Feature 1")
ylab = plt.ylabel("Feature 2")

LR6 = LogisticRegression()
LR6.fit(X4, y4, alpha = 100, max_epochs = 10000)

LR7 = LogisticRegression()
LR7.fit(X4, y4, alpha = 1, max_epochs = 10000)
fig3 = plt.scatter(X4[:,0], X4[:,1], c = y4)
fig3 = draw_line(LR6.w, -2, 2)

fig3 = plt.scatter(X4[:,0], X4[:,1], c = y4)
fig3 = draw_line(LR7.w, -2, 2)


num_steps1 = len(LR6.loss_history)
plt.plot(np.arange(num_steps1) + 1, LR6.loss_history, label = "(a=100)")


num_steps1 = len(LR7.loss_history)
plt.plot(np.arange(num_steps1) + 1, LR7.loss_history, label = "(a=1)")


legend = plt.legend() 

Here we can see that our model performs fairly well at a smaller alpha (1) as the line mostly separates the data. However, for a larger alpha the algorithm does not work well. It is unable to converge and thus results in a bad separater and the loss bounces around

Experiment batch size

np.random.seed(8680)

n = 100
p_features = 4

X4, y4 = make_blobs(n_samples = 100, n_features = p_features-1,centers=[(-1.7,-1.7),(1.7,1.7)])
LR = LogisticRegression()
LR.fit_stochastic(X4, y4, 
                  max_epochs = 10000, 
                  batch_size = 8, 
                  alpha = .1) 

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "stochastic gradient (batch_size=8)")

LR = LogisticRegression()
LR.fit_stochastic(X4, y4, 
                  max_epochs = 10000, 
                  batch_size = 80, 
                  alpha = .1)

num_steps = len(LR.loss_history)
plt.plot(np.arange(num_steps) + 1, LR.loss_history, label = "stochastic gradient (batch_size=80)")


plt.loglog()

legend = plt.legend() 

For this experiment I compared a stochastic batch size of 8 to a batch size of 80. We can see that a smaller batch size allows the algorithm to converge must faster. Increasing the batch size by a factor of 8 almost increased the number of needed epochs by a factor of 10